Accelerating popular Hugging Face models using Arm Neoverse

June 5, 2024

7 minute read time.

AI has the potential to touch and transform all aspects of our lives. Innovations are being made today using AI across industries such as:

Healthcare
Finance
Manufacturing
Education
Media
Transportation

All these industries use AI to improve productivity, consumer decision-making, and enhancing the education experience. Running the complex AI workloads that deliver these results requires significant compute and data center power.

Today’s data centers already consume lots of power and that will only continue to grow as AI deployments broaden and the underlying foundation models get bigger in size. Arm is addressing this challenge by enabling additional AI capacity without adding to the energy problem. As Generative AI and foundation models have gained popularity, deployments have been made difficult by availability of specialized compute hardware and their associated high costs. Larger models are also resource-intensive which exacerbates the original problem. The rise of smaller language models and techniques such as quantization is encouraging developers to consider CPUs for machine learning. Smaller models offer efficiency and can be tailored to more narrow and specific applications, making them practical to deploy and more cost-effective.

Arm’s latest Neoverse based CPU platforms offer the most-high performance, power-efficient processors for cloud data centers. Arm Neoverse offers Cloud providers with the flexibility to customize their silicon and optimize their software and systems for the most demanding workloads, all while delivering leading performance and power efficiency. This is why all major cloud providers have all adopted Arm Neoverse technology to design their compute platforms to address the needs of developers for a wide range of cloud workloads including AI and ML workloads.

Popular open-source models in Hugging Face run on-CPU with efficiency and performance. Deploying models can be a time-consuming and challenging task oftentimes requiring deep expertise in ML and underlying model code. Hugging Face pipelines abstract away complexity from the code and enable developers to use any model from the Hub for inference. Developers building AI applications and projects can benefit from the ease of cloud infrastructure resources, power efficiency, and cost savings associated with Arm-powered cloud instances.

Key features in Arm Neoverse CPUs for Machine Learning

CPUs have long benefited from using a single instruction to process multiple data points simultaneously resulting in data level parallelism and performance gains, in a technique known as SIMD. Arm Neoverse CPUs support advanced SIMD technology such as NEON and SVE which can accelerate common algorithms used in HPC and ML.

GEMM (General Matrix Multiplications) is an essential algorithm in machine learning that performs a complex multiplication of two input matrices together to get one output. Arm v8.6-A architecture adds SMMLA and FMMLA instructions that can perform these multiplications on a 2 or 4-wide array at a time, reducing the fetch cycles by 2x to 4x and the compute cycles by 4x to 16x. These instructions are part of several Arm-based server processors including AWS Graviton3 and Graviton4, NVIDIA Grace, Google Axion and Microsoft Cobalt.

These key features benefit machine learning cross many use cases including:

Image Classification, a form of supervised learning where a specified label or class is assigned to an entire image
Object Detection, a computer vision technique for locating instances of objects in images or videos
Natural Language Processing, a form of artificial intelligence that gives machines the ability to read, understand and derive meaning from human language
Automatic Speech Recognition, a form of machine learning to convert human speech into written text
Recommended systems, machine learning algorithms that use data to recommend items or content to users
Small Language Models (SLMs), streamlined versions of LLMs that have simpler architecture, fewer parameters and require less data and time to be trained

With these ML inferencing capabilities with Arm Neoverse based AWS Graviton3 processors, we have achieved up to 3x better performance boost compared to previous generation AWS Graviton2 processors. Let’s dive into a sentiment analysis use case:

Sentiment analysis with Hugging Face pipeline

Sentiment analysis is a vital AI technique that figures out emotions and opinions from written text. Businesses use it to grasp what customers think, evaluate how people perceive their brand, and shape marketing decisions. But running sentiment analysis models efficiently can be demanding on computational resources. This blog post dives into how Arm Neoverse CPUs can speed up sentiment analysis, resulting in quicker and more impactful AI-driven insights.

Specifically, we are going to focus on speeding up the NLP PyTorch models (BERT, DistilBERT, and RoBERTa) on Arm Neoverse CPUs with default PyTorch package available in pytorch.org. We will use the Hugging Face Transformer Sentiment Analysis pipeline to run these models.

Hugging Face Transformers simplify the use of their pre-trained models with a powerful tool called pipelines. These pipelines handle complexities behind the scenes, allowing you to focus on solving the actual problem.

For instance, if you want to analyze the sentiment of a piece of text, just input it into the pipeline. It will provide sentiment classification (positive or negative) without worrying about model loading, tokenization, or other technical details.

This bit of code uses pipeline class to check how people feel about input text. It uses a ready-made model from Hugging Face model hub behind the scenes.

Code:

Fullscreen

1
2
3
4
from transformers import pipeline
pipe = pipeline("sentiment-analysis")
data = ["I like the product a lot", "I wish I had not bought this"]
pipe(data)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

from transformers import pipeline
pipe = pipeline("sentiment-analysis")
data = ["I like the product a lot", "I wish I had not bought this"]
pipe(data)

Output:

Fullscreen

1
2
[{'label': 'POSITIVE', 'score': 0.9997499585151672},
 {'label': 'NEGATIVE', 'score': 0.9996662139892578}]
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

[{'label': 'POSITIVE', 'score': 0.9997499585151672},
 {'label': 'NEGATIVE', 'score': 0.9996662139892578}]

You can also specify a model of your choice using the model parameter.

pipe = pipeline("sentiment-analysis", model=”distilbert-base-uncased”)

When adding sentiment analysis to your existing application, it's important to consider latency. For real-time use cases, a response time of less than 100 milliseconds is typically perceived as instantaneous. However, higher latency may be acceptable for your specific needs.

Performance for AWS Graviton processors

We took two reviews, a short review (32 tokens when tokenized with BertTokenizer) and a long review (128 tokens when tokenized with BertTokenizer) and benchmarked on AWS Graviton 2 (c6g) and AWS Graviton 3 (c7g).

Both AWS Graviton2 (c6g) and AWS Graviton3 (c7g) met the ideal real-time latency target of 100ms for a short review sentiment analysis with just 4-vCPUs, as it can be seen in the below graph.

Sentiment Analysis for a 32-token sentence

AWS Graviton3 (c7g) with BF16 enabled can meet ideal real-time latency target for a longer review sentiment analysis with 4-vCPUs. Arm Neoverse V1 based c7g instances provide up to a 3x boost in performance compared to previous generation c6g instances (with Arm Neoverse N1 CPUs).

Sentiment Analysis for a 128-token sentence

Benchmarking setup

We conducted the benchmark tests on following AWS EC2 instances:

a c6g.xlarge instance with Arm Neoverse N1 CPUs
a c7g.xlarge instance with Arm Neoverse V1 CPUs

The instances have 4 vCPUs. We set the instances up with following software.

Ubuntu Server 22.04 LTS (HVM) - ami-0c1c30571d2dae5c9 (64-bit (x86)) and ami-0c5789c17ae99a2fa (64-bit (Arm))
PyTorch 2.2.2
Transformers 4.39.1

We followed the following setup steps.

sudo apt-get update
sudo apt-get install python3 python3-pip
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip3 install transformers

For further details on the installation process, refer to https://learn.arm.com/install-guides/pytorch/

Arm PyTorch Installation Guide (https://learn.arm.com/install-guides/pytorch/) and PyTorch Inference Tuning on AWS Graviton (https://pytorch.org/tutorials/recipes/inference_tuning_on_aws_graviton.html) provide a few tuning parameters for Arm.

For the benchmarking, we enabled bfloat16 fast math kernels on all platforms as shown below. On AWS Graviton3, this enables GEMM kernels that use bfloat16 MMLA instructions available in the hardware.

export DNNL_DEFAULT_FPMATH_MODE=BF16

We used two reviews – a short and long review.

short_review: "I'm extremely satisfied with my new Ikea Kallax; It's an excellent storage solution for our kids. A definite must have."

long_review: "We were in search of a storage solution for our kids, and their desire to personalize their storage units led us to explore various options. After careful consideration, we decided on the Ikea Kallax system. It has proven to be an ideal choice for our needs. The flexibility of the Kallax design allows for extensive customization. Whether it’s choosing vibrant colors, adding inserts for specific items, or selecting different finishes, the possibilities are endless. We appreciate that it caters to our kids’ preferences and encourages their creativity. Overall, the boys are thrilled with the outcome. A great value for money."

We evaluated three NLP models (distilbert-base-uncased, bert-base-uncased, and roberta-base) using the sentiment analysis pipeline.

For each model, we measure the execution time for both short and long sentences. In the benchmark function, we perform a warm-up phase (running the pipeline 100 times) to ensure consistent results. Then, we measure the execution time for each run and calculate the mean and 99th percentile values.

Conclusion

With AWS Graviton3, you can add sentiment analysis to your existing application that meets stringent real-time latency requirements with just 4 vCPUs.

AWS Graviton3, equipped with an Arm Neoverse V1 CPU featuring ML-specific features like bfloat16 MMLA extension, delivers outstanding inference performance for Hugging Face Sentiment Analysis PyTorch models.

Feel free to try it with your models. Depending on your model, you might need to fine-tune performance. For this purpose, the following resources will be useful:

PyTorch Install Guide on learn.arm.com (https://learn.arm.com/install-guides/pytorch/).

PyTorch Inference Performance Tuning on AWS Graviton Processors (https://pytorch.org/tutorials/recipes/inference_tuning_on_aws_graviton.html).

0 comments
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Accelerating popular Hugging Face models using Arm Neoverse

Key features in Arm Neoverse CPUs for Machine Learning

Sentiment analysis with Hugging Face pipeline

Performance for AWS Graviton processors

Benchmarking setup

Conclusion

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Arm CMN S3: Driving CXL storage innovation

Streamline Arm adoption with GitHub Copilot and Arm64 Runners